Analysis of Red Wine by Michael Bong

The data set

I have chosen to analyse the red wine data set to determine the key variables that determine wine quality. My aim is to use my learnings here to enable me to provide a (slightly) more sophisticated commentary on which red wines would taste good in the future without actually tasting it!

In this data set, each record represents a type of red wine. The characteristics of each wine is provided (such as fixed acidity levels, pH etc.) and finally, each wine is assigned a quality score (ranging from 0 to 10), based on sensory data.

Initial look at the data

The red wine data set is loaded and the first 10 rows is printed out below.

##     X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1   1           7.4             0.70        0.00            1.9     0.076
## 2   2           7.8             0.88        0.00            2.6     0.098
## 3   3           7.8             0.76        0.04            2.3     0.092
## 4   4          11.2             0.28        0.56            1.9     0.075
## 5   5           7.4             0.70        0.00            1.9     0.076
## 6   6           7.4             0.66        0.00            1.8     0.075
## 7   7           7.9             0.60        0.06            1.6     0.069
## 8   8           7.3             0.65        0.00            1.2     0.065
## 9   9           7.8             0.58        0.02            2.0     0.073
## 10 10           7.5             0.50        0.36            6.1     0.071
##    free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                   11                   34  0.9978 3.51      0.56     9.4
## 2                   25                   67  0.9968 3.20      0.68     9.8
## 3                   15                   54  0.9970 3.26      0.65     9.8
## 4                   17                   60  0.9980 3.16      0.58     9.8
## 5                   11                   34  0.9978 3.51      0.56     9.4
## 6                   13                   40  0.9978 3.51      0.56     9.4
## 7                   15                   59  0.9964 3.30      0.46     9.4
## 8                   15                   21  0.9946 3.39      0.47    10.0
## 9                    9                   18  0.9968 3.36      0.57     9.5
## 10                  17                  102  0.9978 3.35      0.80    10.5
##    quality
## 1        5
## 2        5
## 3        5
## 4        6
## 5        5
## 6        5
## 7        5
## 8        7
## 9        7
## 10       5

The variables are as follows.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Units for each variable as follows: 1. fixed acidity (tartaric acid - g / dm^3) 2. volatile acidity (acetic acid - g / dm^3) 3. citric acid (g / dm^3) 4. residual sugar (g / dm^3) 5. chlorides (sodium chloride - g / dm^3 6. free sulfur dioxide (mg / dm^3) 7. total sulfur dioxide (mg / dm^3) 8. density (g / cm^3) 9. pH 10. sulphates (potassium sulphate - g / dm3) 11. alcohol (% by volume) 12. quality (score between 0 and 10, based on sensory data)

Univariate Plots Section

This data set has 1599 records across 13 variables.

## [1] 1599   13

The structure of the data is as follows.

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

A quick summary of the data set below.

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

A quick look at the histogram for each variable below, with summary statistics ( min = 1st red line, q1 = 1st blue line, median = Orange line, q3 = 2nd blue line, max = 2nd red line, mean = Green line ).

Let’s look at each variable in turn.

Fixed acidity

Fixed acidity is not normally distributed. It has a long tail as there are some large acidity figures, although with low frequency.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90
## [1] "Standard deviation is:  1.7410963181277"

The 1og10 plot has a more normal distribution, as can be seen by smaller standard deviation.

## [1] "Standard deviation 1og10 scale is:  0.0773478330552048"

Volatile acidity

Volatile acidity is not normally distributed. It has a long tail as there are some large acidity figures, although with low frequency. It is more normally distributed than fixed acidity.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800
## [1] "Standard deviation is:  0.179059704153535"

The 1og10 plot has a slightly more normal distribution, as can be seen by smaller standard deviation.

## [1] "Standard deviation 1og10 scale is:  0.0499117519457995"

Citric acid

Citric acid has a long tail distribution as with the previous 2 variables. It is not normally distrbuted and the distribution is stretched by large acid levels for several red wines.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000
## [1] "Standard deviation is:  0.194801137405319"

Attempts to normalise the distribution of citric acid resulted in a reversal of the direction of long-tail distribution from left skewed (bulk of observation on the low end) to right skewed (bulk of observation on the high end).

## [1] "Standard deviation 1og10 scale is:  0.0661964066049957"

Residual sugar

Residual sugar has a bulk of observations in the lower end, with a lot of outliers. This results in a large right skew, long tailed distribution. We will attempt to normalise this using log10 plot.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500
## [1] "Standard deviation is:  1.40992805950728"

The log10 plot is more normally distributed, however it is still quite right skewed.

## [1] "Standard deviation 1og10 scale is:  0.11724613311005"

Chlorides

Similar to residual sugar, chlorides has a bulk of observations in the lower end, with a lot of outliers. This results in a large right skew, long tailed distribution. We will attempt to normalise this using log10 plot.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
## [1] "Standard deviation is:  0.0470653020100901"

The log10 plot is more normally distributed, but still has a rather right skew.

## [1] "Standard deviation 1og10 scale is:  0.0169336059155253"

Free sulfur dioxide

Free sulfur dioxide is quite similar in distribution to chlorides and residual sugar, although it is less extreme. It is right skewed, has sizeable outliers and is not normally distributed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00
## [1] "Standard deviation is:  10.4601569698097"

Attempts to normalise using log10 plot has worked, with a more even distribution of observations and greatly reduced outlier count.

## [1] "Standard deviation 1og10 scale is:  0.270908515351267"

Total sulfur dioxide

The distribution of total sulfur dioxide is very similar to free sulfur dioxide.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00
## [1] "Standard deviation is:  32.8953244782991"

Using a log10 plot, we were able to normalise the distribution and greatly reduce the number of outliers, like in the previous variable.

## [1] "Standard deviation 1og10 scale is:  0.296438379192448"

Density

Density looks normally distributed, with some outliers. Most observations are concentrated between 0.995 and 1. Mean and medain are identical.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037
## [1] "Standard deviation is:  0.00188733395384256"

Using a log10 plot has made this variable even more normally distributed.

## [1] "Standard deviation 1og10 scale is:  0.00041048390118351"

pH

Much like density, pH is very normally distributed. Mean and median are identical and pH of wines are mostly concentrated between 3 to 3.5, with some outliers. 75% of wines have pH less than 3.4 (quite acidic).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010
## [1] "Standard deviation is:  0.154386464903543"

Using a log10 plot has made this variable even more normally distributed.

## [1] "Standard deviation 1og10 scale is:  0.0155302548325435"

Sulphates

Similar to total and free sulfur dioxide, sulphates is right skewed with a bulk of observations in the lower end of the sulphates scale. It is long tailed and not normally distributed, with substantial amount of large, infrequent outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000
## [1] "Standard deviation is:  0.16950697959011"

Using a log10 scale, the distribution of sulphates is more binomial, with a marked reduction in number of outliers. Mean and median are closer together.

## [1] "Standard deviation 1og10 scale is:  0.0407069650816158"

Alcohol

Alcohol is also right skewed, with a bulk of observations in the lower end of the alcohol scale. 75% of wines have less than 11% alcohol.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
## [1] "Standard deviation is:  1.06566758184739"

Using a log10 scale, the distribution of alcohol is more normally distributed, with minimal outliers remaining.

## [1] "Standard deviation 1og10 scale is:  0.0392750529614007"

Quality

Quality is quite normally distributed, with a slight left skew. 50% of all wines analysed have a score between 5 and 6, the 8 being the highest score. This means that this analysis will be based mostly off of average quality wines.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## [1] "Standard deviation is:  0.807569439734705"

Additional variable: Total acidity

Total acidity is calculated variable (Fixed acidity + Volatile acidity). As a combination of both fixed and volatile acidity, it has a similar distribution to them.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.120   7.680   8.445   8.847   9.740  16.285
## [1] "Standard deviation is:  1.70404692804485"

Using a log10 scale, the distribution of total acidity is more normally distributed, with less outliers remaining.

## [1] "Standard deviation 1og10 scale is:  0.0718503550795018"

Univariate Analysis

What is the structure of your dataset?

The dataset has 1599 different wines, with characteristics recorded across 13 variables. All the variables are numerical in nature.

What is/are the main feature(s) of interest in your dataset?

Quality is of greatest interest, and it forms the anchor for the upcoming analysis. My aim is to determine how the other variables drive the quality of wine up or down.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

From past experience and through quick research, it seems features such as alcohol content, pH, residual sugar and acid content are factors that result in wines of distinct tastes and textures (which when combined and tasted, either elicits positive or negative responses).

Did you create any new variables from existing variables in the dataset?

A new variable called total acidity was created (Fixed acidity + Volatile acidity). This is done so that we have a general acidity variable to analyse. Citric acid is not included in this new variable.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Based on the distribution of quality, our analyses will be based mostly on wines with an average quality rating (between 5 and 6). This will make it difficult to determine what combination of levels within each variable makes very good, or very bad scoring wines.

Quite a few of the variables were not normally distributed, contrary to expectations of more normal distributions. There were extreme outliers for some of these variables. In most cases where the distribtuion is long tailed, the log10 scale adjustment is used to make the observations of variables more normally distrbuted.

Bivariate Plots Section

Firstly, we look at correlation between the variables. Quality has relatively strong correlation with alcohol (positive correlation of 0.476) and volatile acidity (negative correlation of -0.391).

Quality has a slight but notable correlation with citric acid (positive correlation of 0.226) and sulphates (positive correlation of 0.251).

Initial expectations that pH and residual sugar would impact quality will need to be revisited as both have very low correlation with quality.

We will look at the interaction between quality of wines and each of the potential predictor variables below.

The following applies to the plots below: - Median: Orange line - Mean: Green line - Linear smoothing: Dashed blue line

Quality & Fixed acidity

Fixed acidity and quality are slightly positive correlated (cor = 0.124). As mean and median of fixed acidity increases, the quality of wines increases.

The log10 plots show the same pattern albeit with less outliers.

## 
##  Pearson's product-moment correlation
## 
## data:  rw$fixed.acidity and rw$quality
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07548957 0.17202667
## sample estimates:
##       cor 
## 0.1240516
## rw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.700   7.150   7.500   8.360   9.875  11.600 
## -------------------------------------------------------- 
## rw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.600   6.800   7.500   7.779   8.400  12.500 
## -------------------------------------------------------- 
## rw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   7.100   7.800   8.167   8.900  15.900 
## -------------------------------------------------------- 
## rw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.700   7.000   7.900   8.347   9.400  14.300 
## -------------------------------------------------------- 
## rw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.900   7.400   8.800   8.872  10.100  15.600 
## -------------------------------------------------------- 
## rw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   7.250   8.250   8.567  10.225  12.600

Quality & Volatile acidity

Volatile acidity and quality are negatively correlated (cor = -0.391). Wines of better quality have a markedly lower mean and median volatile acidity content.

The relationship between volatile acidity and pH is discussed in a later section.

The log10 plots show the same pattern.

## 
##  Pearson's product-moment correlation
## 
## data:  rw$volatile.acidity and rw$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578
## rw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## rw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## rw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## rw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## rw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## rw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

Quality & Citric acid

Citric acid and quality are positively correlated (cor = 0.226). Wines of better quality have a markedly higher mean and median citric acid content.

The relationship between citric acid and pH is discussed in a later section.

The log10 plots show the same pattern.

## 
##  Pearson's product-moment correlation
## 
## data:  rw$citric.acid and rw$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725
## rw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0050  0.0350  0.1710  0.3275  0.6600 
## -------------------------------------------------------- 
## rw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0300  0.0900  0.1742  0.2700  1.0000 
## -------------------------------------------------------- 
## rw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2300  0.2437  0.3600  0.7900 
## -------------------------------------------------------- 
## rw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2600  0.2738  0.4300  0.7800 
## -------------------------------------------------------- 
## rw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3050  0.4000  0.3752  0.4900  0.7600 
## -------------------------------------------------------- 
## rw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0300  0.3025  0.4200  0.3911  0.5300  0.7200

Quality & Residual sugar

Residual sugar and quality have a very small, negligible positive correlation (cor = 0.0137). Wines generally have residual sugar content of 0 to 3.1. This variable has minimal impact on quality of wine.

However, we should consider the interation between density and residual sugar (covered in Quality & Density section), which suggests it may have an indirect impact on quality of wine.

The log10 plots show the same pattern.

## 
##  Pearson's product-moment correlation
## 
## data:  rw$residual.sugar and rw$quality
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03531327  0.06271056
## sample estimates:
##        cor 
## 0.01373164
## rw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   1.875   2.100   2.635   3.100   5.700 
## -------------------------------------------------------- 
## rw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.300   1.900   2.100   2.694   2.800  12.900 
## -------------------------------------------------------- 
## rw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   1.900   2.200   2.529   2.600  15.500 
## -------------------------------------------------------- 
## rw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.477   2.500  15.400 
## -------------------------------------------------------- 
## rw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   2.000   2.300   2.721   2.750   8.900 
## -------------------------------------------------------- 
## rw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.400   1.800   2.100   2.578   2.600   6.400

Quality & Chlorides

Chlorides and quality are negatively correlated (cor = -0.129). Wines of better quality have a lower mean and median chloride content. A bulk of the wine generally has chloride content between 0 and 0.1.

The log10 plots show the same pattern.

## 
##  Pearson's product-moment correlation
## 
## data:  rw$chlorides and rw$quality
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17681041 -0.08039344
## sample estimates:
##        cor 
## -0.1289066
## rw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0610  0.0790  0.0905  0.1225  0.1430  0.2670 
## -------------------------------------------------------- 
## rw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04500 0.06700 0.08000 0.09068 0.08900 0.61000 
## -------------------------------------------------------- 
## rw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03900 0.07400 0.08100 0.09274 0.09400 0.61100 
## -------------------------------------------------------- 
## rw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03400 0.06825 0.07800 0.08496 0.08800 0.41500 
## -------------------------------------------------------- 
## rw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.06200 0.07300 0.07659 0.08700 0.35800 
## -------------------------------------------------------- 
## rw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04400 0.06200 0.07050 0.06844 0.07550 0.08600

Quality & Free sulfur dioxide

Free sulfur dioxide and quality have a very small, negligible negative correlation (cor = -0.0507). This variable has minimal impact on quality of wine. Interestingly, the lower quality wines (with scores of 3, 4) and the higher quality wines (with scores of 7, 8) have low levels of free sulfur dioxide. Free sulfur dioxide prevents microbial growth and wine oxidation and is added to wine to improve its aging potential.

The log10 plots show the same pattern with less outliers.

## 
##  Pearson's product-moment correlation
## 
## data:  rw$free.sulfur.dioxide and rw$quality
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.099430290 -0.001638987
## sample estimates:
##         cor 
## -0.05065606
## rw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0     5.0     6.0    11.0    14.5    34.0 
## -------------------------------------------------------- 
## rw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    6.00   11.00   12.26   15.00   41.00 
## -------------------------------------------------------- 
## rw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    9.00   15.00   16.98   23.00   68.00 
## -------------------------------------------------------- 
## rw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    8.00   14.00   15.71   21.00   72.00 
## -------------------------------------------------------- 
## rw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    6.00   11.00   14.05   18.00   54.00 
## -------------------------------------------------------- 
## rw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    6.00    7.50   13.28   16.50   42.00

Quality & Total sulfur dioxide

Similar to free sulfur dioxide, total sulfur dioxide and quality have a small negative correlation (cor = -0.185). This variable has a small impact on quality of wine. Interestingly, the lower quality wines (with scores of 3, 4) and the higher quality wines (with scores of 7, 8) have low levels of total sulfur dioxide.

The log10 plots show the same pattern with less outliers.

## 
##  Pearson's product-moment correlation
## 
## data:  rw$total.sulfur.dioxide and rw$quality
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2320162 -0.1373252
## sample estimates:
##        cor 
## -0.1851003
## rw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0    12.5    15.0    24.9    42.5    49.0 
## -------------------------------------------------------- 
## rw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   14.00   26.00   36.25   49.00  119.00 
## -------------------------------------------------------- 
## rw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   26.00   47.00   56.51   84.00  155.00 
## -------------------------------------------------------- 
## rw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   23.00   35.00   40.87   54.00  165.00 
## -------------------------------------------------------- 
## rw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   17.50   27.00   35.02   43.00  289.00 
## -------------------------------------------------------- 
## rw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   16.00   21.50   33.44   43.00   88.00

Quality & Density

Density and quality are negatively correlated (cor = -0.175). Density has a negative correlation with alcohol content (cor = -0.496) and a positive correlation with residual sugar content (cor = 0.355).

Wines of better quality have a lower mean and median density level. This implies that higher quality wine has higher alcohol content and lower residual sugar content.

The log10 plots show the same pattern.

## 
##  Pearson's product-moment correlation
## 
## data:  rw$density and rw$quality
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2220365 -0.1269870
## sample estimates:
##        cor 
## -0.1749192
## rw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9947  0.9961  0.9976  0.9975  0.9988  1.0008 
## -------------------------------------------------------- 
## rw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9934  0.9957  0.9965  0.9965  0.9974  1.0010 
## -------------------------------------------------------- 
## rw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9926  0.9962  0.9970  0.9971  0.9979  1.0031 
## -------------------------------------------------------- 
## rw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9954  0.9966  0.9966  0.9979  1.0037 
## -------------------------------------------------------- 
## rw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9906  0.9948  0.9958  0.9961  0.9974  1.0032 
## -------------------------------------------------------- 
## rw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9908  0.9942  0.9949  0.9952  0.9972  0.9988

Quality & pH

pH and quality have a very small negative correlation (cor = -0.0577). Wines generally have pH levels of 3.2 to 3.5. This variable has small direct impact on quality of wine.

However, we should consider the interation between pH and citric acid and volatile acid, which suggests it may have an indirect impact on quality of wine. pH has a negative correlation with citric acid (cor = -0.542) and a positive correlation with volatile acidity (cor = 0.235). As higher quality wine has lower volatile acidity, this implies that it has lower pH. Higher quality wine also has higher citric acid content, implying that it has lower pH.

The log10 plots show the same pattern.

## 
##  Pearson's product-moment correlation
## 
## data:  rw$pH and rw$quality
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.106451268 -0.008734972
## sample estimates:
##         cor 
## -0.05773139
## rw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.160   3.312   3.390   3.398   3.495   3.630 
## -------------------------------------------------------- 
## rw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.300   3.370   3.382   3.500   3.900 
## -------------------------------------------------------- 
## rw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.200   3.300   3.305   3.400   3.740 
## -------------------------------------------------------- 
## rw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.860   3.220   3.320   3.318   3.410   4.010 
## -------------------------------------------------------- 
## rw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.920   3.200   3.280   3.291   3.380   3.780 
## -------------------------------------------------------- 
## rw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.163   3.230   3.267   3.350   3.720

Quality & Sulphates

Sulphates and quality are positively correlated (cor = 0.251).

Sulphates contribute to sulfur dioxide gas that prevents microbial growth and wine oxidation and is added to wine to improve its aging potential. Given this, I expected a high correlation between sulphatesand both total and free sulfur dioxide levels, however the correlation in both cases are negligible.

Wines of better quality have a markedly higher mean and median sulphate content. This enables the wines to age better, and thus taste better.

The log10 plots show the same pattern.

## 
##  Pearson's product-moment correlation
## 
## data:  rw$sulphates and rw$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971
## rw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5125  0.5450  0.5700  0.6150  0.8600 
## -------------------------------------------------------- 
## rw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4900  0.5600  0.5964  0.6000  2.0000 
## -------------------------------------------------------- 
## rw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.370   0.530   0.580   0.621   0.660   1.980 
## -------------------------------------------------------- 
## rw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5800  0.6400  0.6753  0.7500  1.9500 
## -------------------------------------------------------- 
## rw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7413  0.8300  1.3600 
## -------------------------------------------------------- 
## rw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6300  0.6900  0.7400  0.7678  0.8200  1.1000

Quality & Alcohol

Alcohol content and quality are positively correlated (cor = 0.476). Wines of better quality have a markedly higher mean and median alcohol content. Most of the wines with quality score of 7+ have alcohol content of 10.8% and above.

Higher alcohol content also means lower density (ie. lighter feel) to the wine given both variables are negatively correlated (cor = -0.496).

The log10 plots show the same pattern.

## 
##  Pearson's product-moment correlation
## 
## data:  rw$alcohol and rw$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663
## rw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## -------------------------------------------------------- 
## rw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## rw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## rw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## rw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## rw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

Quality & Total acidity

Total acidity and quality have a very small positive correlation (cor = 0.0857). This variable has minimal impact on quality of wine.

The log10 plots show the same pattern, with less outliers.

## 
##  Pearson's product-moment correlation
## 
## data:  rw$total.acidity and rw$quality
## t = 3.4378, df = 1597, p-value = 0.0006015
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.03684298 0.13416675
## sample estimates:
##        cor 
## 0.08570932
## rw$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   7.460   8.051   8.883   9.245  10.460  12.180 
## -------------------------------------------------------- 
## rw$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.120   7.380   8.185   8.473   9.070  12.960 
## -------------------------------------------------------- 
## rw$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.520   7.735   8.390   8.744   9.490  16.260 
## -------------------------------------------------------- 
## rw$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.300   7.605   8.400   8.845   9.881  14.610 
## -------------------------------------------------------- 
## rw$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.320   7.880   9.110   9.276  10.485  16.285 
## -------------------------------------------------------- 
## rw$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.420   7.625   8.730   8.990  10.530  12.910

Relevant variables

The following variables are deemed relevant and could be useful in predicting the quality of wines.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I found that alcohol, volatile acidity have sufficient relationship with quality. On a slightly lower extent, sulphates, citric acid, density and chlorides are also related to quality. All the listed variables will be useful as predictors of wine quality in a predictive model.

These features were chosen as they had a sufficient correlation with wine quality.

I am keen to explore the relationship between residual sugar and density, and also the relationship between pH and citric acid, and how it could impact quality too.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Interestingly, although volatile acidity and citric acid are flagged as useful predictors, fixed and total acidity do not share the same relationships with wine quality.

Sulphates is useful as a wine quality predictor, but the same is not true for free and total sulfur dioxide.

What was the strongest relationship you found?

The strongest relationship was between alcohol levels and wine quality with a a correlation of 0.476.

Multivariate Plots Section

The approach I will take here is to firstly to focus on variables with highest absolute correlation to quality.

For each of the focus variables, I will then pair it with the variables (non-quality) with an absolute correlation of 0.20

Pair 1: Alcohol - Volatile acidity, Density, Chlorides, pH

Alcohol has the strongest correlation with wine quality (cor = 0.476).

Alcohol has a correlation of -0.202 with volatile acidity. Higher wine quality has higher alcohol content and lower volatile acidity.

Differing wine qualities have a relatively clear distinction in volatile acidity levels.

Action: Add volatile acidity as predictor.

Alcohol has a correlation of -0.496 with density. Higher wine quality has higher alcohol content and lower density.

Differing wine qualities have a relatively clear distinction in density levels.

Action: Add density as predictor.

Alcohol has a correlation of -0.221 with chlorides. Higher wine quality has higher alcohol content and lower chlorides.

Differing wine qualities have a relatively clear distinction in chlorides levels.

Action: Add chlorides as predictor.

The plot is recreated with chlorides under a log10 scale. The patterns are more pronounced.

Alcohol has a correlation of 0.206 with pH. Higher wine quality has higher alcohol content with relatively lower in pH. pH levels tend to increase as alcohol levels increase though.

Differing wine qualities have a relatively clear distinction in pH levels.

Action: Add pH as predictor.

Pair 2: Volatile acidity - Sulphates, Citric acid, pH

Volatile acidity has a sufficient correlation with wine quality (cor = -0.391).

Volatile acidity has a correlation of -0.261 with sulphates. Higher wine quality has lower volatile acidity and higher sulphates.

Differing wine qualities have a relatively clear distinction in sulphates levels.

Action: Add sulphates as predictor.

The plot is recreated with sulphates under a log10 scale. The patterns are more pronounced.

Volatile acidity has a correlation of -0.552 with citric acid. Higher wine quality has lower volatile acidity and slightly higher citric acid.

Differing wine qualities have a relatively clear distinction in citric acid levels.

Action: Add citric acid as predictor.

Volatile acidity has a correlation of 0.235 with pH. Higher wine quality has lower volatile acidity and sligtly lower pH levels.

Differing wine qualities have a relatively clear distinction in pH levels.

Action: Add pH as predictor.

Pair 3: Sulphates - Citric acid, Chlorides

Sulphates has a sufficient correlation with wine quality (cor = 0.251).

Sulphates has a correlation of 0.313 with citric acid. Higher wine quality has higher sulphates and any slightly higher citric acid levels.

Differing wine qualities have a relatively clear distinction in citric acid levels.

Action: Add citric acid as predictor.

Sulphates has a correlation of 0.371 with chlorides. Higher wine quality has higher sulphates and lower chlorides levels.

Differing wine qualities have a relatively clear distinction in chlorides levels.

Action: Add chlorides as predictor.

Pair 4: Citric acid - Density, Chlorides, pH

Citric acid has a sufficient correlation with wine quality (cor = 0.226).

Citric acid has a correlation of 0.365 with density. Higher wine quality has slightly higher citric acid levels and have lower density.

Differing wine qualities have a relatively clear distinction in density levels.

Action: Add density as predictor.

Citric acid has a correlation of 0.204 with chlorides. Higher wine quality is not really impacted by citric acid levels and have lower chlorides.

Differing wine qualities have a relatively clear distinction in chlorides levels.

Action: Add chlorides as predictor.

Citric acid has a correlation of -0.542 with pH. Higher wine quality is not really impacted by citric acid levels and has slightly lower pH levels.

Differing wine qualities have a relatively clear distinction in pH levels.

Action: Add pH as predictor.

Pair 5: Density - Chlorides, pH, Residual sugar

Density has a sufficient correlation with wine quality (cor = -0.175).

Density has a correlation of 0.201 with chlorides. Higher wine quality has lower density and lower chlorides.

Differing wine qualities have a relatively clear distinction in chlorides levels.

Action: Add chlorides as predictor.

Density has a correlation of -0.342 with pH. Higher wine quality has lower density and slightly lower pH levels.

Differing wine qualities have a relatively clear distinction in pH levels.

Action: Add pH as predictor.

Density has a correlation of 0.355 with residual sugar. Higher wine quality has lower density and seems to have slightly higher residual sugar content.

Differing wine qualities do not have a relatively clear distinction in residual sugar levels.

Action: Remove residual sugar as predictor.

Pair 6: Chlorides - pH

Chlorides has a sufficient correlation with wine quality (cor = -0.129).

Chlorides has a correlation of -0.265 with pH. Higher wine quality has lower chlorides and has slightly lower pH levels.

Differing wine qualities have a relatively clear distinction in pH levels.

Action: Add pH as predictor.

Prediction model

Based on the previous analyses, I have decided to use the following variables are predictors for wine quality: - Alcohol, Volatile acidity, Sulphates, Density, Chlorides, pH, Citric acid

The following variable have been excluded as predictors: - Residual sugar

## 
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = rw)
## m2: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity, data = rw)
## m3: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates, 
##     data = rw)
## m4: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates + 
##     density, data = rw)
## m5: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates + 
##     density + chlorides, data = rw)
## m6: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates + 
##     density + chlorides + pH, data = rw)
## m7: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + sulphates + 
##     density + chlorides + pH + citric.acid, data = rw)
## 
## ======================================================================================================================
##                          m1            m2            m3            m4            m5            m6            m7       
## ----------------------------------------------------------------------------------------------------------------------
##   (Intercept)           1.875***      3.095***      2.611***     -4.820        -5.677         4.521        -5.885     
##                        (0.175)       (0.184)       (0.196)      (10.387)      (10.335)      (10.708)      (11.930)    
##   I(alcohol)            0.361***      0.314***      0.309***      0.316***      0.300***      0.305***      0.321***  
##                        (0.017)       (0.016)       (0.016)       (0.018)       (0.019)       (0.019)       (0.020)    
##   volatile.acidity                   -1.384***     -1.221***     -1.219***     -1.164***     -1.069***     -1.193***  
##                                      (0.095)       (0.097)       (0.097)       (0.097)       (0.101)       (0.119)    
##   sulphates                                         0.679***      0.664***      0.857***      0.848***      0.848***  
##                                                    (0.101)       (0.103)       (0.112)       (0.112)       (0.112)    
##   density                                                         7.392         8.411        -0.506        10.281     
##                                                                 (10.331)      (10.281)      (10.561)      (11.886)    
##   chlorides                                                                    -1.653***     -1.929***     -1.782***  
##                                                                                (0.394)       (0.401)       (0.407)    
##   pH                                                                                         -0.419***     -0.534***  
##                                                                                              (0.120)       (0.134)    
##   citric.acid                                                                                              -0.269*    
##                                                                                                            (0.136)    
## ----------------------------------------------------------------------------------------------------------------------
##   R-squared             0.227         0.317         0.336         0.336         0.343         0.348         0.350     
##   adj. R-squared        0.226         0.316         0.335         0.334         0.341         0.346         0.347     
##   sigma                 0.710         0.668         0.659         0.659         0.655         0.653         0.653     
##   F                   468.267       370.379       268.912       201.751       166.599       141.816       122.332     
##   p                     0.000         0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood    -1721.057     -1621.814     -1599.384     -1599.127     -1590.346     -1584.293     -1582.343     
##   Deviance            805.870       711.796       692.105       691.882       684.325       679.163       677.509     
##   AIC                3448.114      3251.628      3208.768      3210.255      3194.692      3184.587      3182.686     
##   BIC                3464.245      3273.136      3235.654      3242.518      3232.332      3227.604      3231.080     
##   N                  1599          1599          1599          1599          1599          1599          1599         
## ======================================================================================================================

The original model has an R-squared value of 0.35, meaning that it can explain 35% of variability in wine quality.

We can increase this as follows: - From the original model, it seems that Density does not add to the predictive power of the model. Hence, it is removed from this revised model. - Additionally, we will use log10 scales for sulphates and chlorides as it will improve the normality of the distribution of the observations, reduce outliers and improve its predictive power.

The revised model has an improved R-squared value of 0.36.

## 
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = rw)
## m2: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity, data = rw)
## m3: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + log10(sulphates), 
##     data = rw)
## m4: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + log10(sulphates) + 
##     citric.acid, data = rw)
## m5: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + log10(sulphates) + 
##     citric.acid + log10(chlorides), data = rw)
## m6: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + log10(sulphates) + 
##     citric.acid + log10(chlorides) + pH, data = rw)
## 
## ========================================================================================================
##                          m1            m2            m3            m4            m5            m6       
## --------------------------------------------------------------------------------------------------------
##   (Intercept)           1.875***      3.095***      3.369***      3.444***      3.080***      4.842***  
##                        (0.175)       (0.184)       (0.184)       (0.196)       (0.218)       (0.449)    
##   I(alcohol)            0.361***      0.314***      0.303***      0.303***      0.283***      0.302***  
##                        (0.017)       (0.016)       (0.016)       (0.016)       (0.017)       (0.017)    
##   volatile.acidity                   -1.384***     -1.156***     -1.217***     -1.107***     -1.110***  
##                                      (0.095)       (0.097)       (0.112)       (0.116)       (0.115)    
##   log10(sulphates)                                  1.477***      1.518***      1.716***      1.742***  
##                                                    (0.177)       (0.181)       (0.188)       (0.187)    
##   citric.acid                                                    -0.113        -0.013        -0.276*    
##                                                                  (0.103)       (0.106)       (0.121)    
##   log10(chlorides)                                                             -0.487***     -0.564***  
##                                                                                (0.132)       (0.132)    
##   pH                                                                                         -0.595***  
##                                                                                              (0.133)    
## --------------------------------------------------------------------------------------------------------
##   R-squared             0.227         0.317         0.345         0.346         0.352         0.360     
##   adj. R-squared        0.226         0.316         0.344         0.344         0.349         0.357     
##   sigma                 0.710         0.668         0.654         0.654         0.651         0.647     
##   F                   468.267       370.379       280.646       210.808       172.704       148.984     
##   p                     0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood    -1721.057     -1621.814     -1587.752     -1587.153     -1580.350     -1570.342     
##   Deviance            805.870       711.796       682.108       681.597       675.822       667.415     
##   AIC                3448.114      3251.628      3185.503      3186.306      3174.699      3156.683     
##   BIC                3464.245      3273.136      3212.389      3218.569      3212.339      3199.700     
##   N                  1599          1599          1599          1599          1599          1599         
## ========================================================================================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The strongest relationships were between wine quality and the 2 strongest correlated variables, alcohol and volatile acidity. It clearly shows that quality of wine increases with higher alcohol levels and lower volatile acidity.

Other variable combinations that strengthened each other were as follows: (alcohol and density), (alcohol and chlorides), (sulphates and volatile acidity), (density and citric acid). This is judged by the level of distinction between the different quality scores each combination of variables can predict.

Were there any interesting or surprising interactions between features?

Alcohol and density was an interesting interaction. I was surprised that density decreased when alcohol levels increased. Incidentally, higher alcohol levels and lower density leads to better scoring wines.

This roughly proves that drinkers like a light textured, and stronger drink!

However, based on our linear model, it seems that density does not have any predictive qualities when alcohol is added a predictor.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

A linear model was created to predict wine quality. The revised model uses the following variables as predictors: Alcohol, Volatile acidity, log10(Sulphates), log10(Chlorides), pH, Citric acid.

It is able to explain 36% of variability in wine quality, which is not very high. This is likely caused by the relatively low correlations between predictors and wine quality.


Final Plots and Summary

Plot One

Description One

This first plot illustrates that a majority of the of the wines in the data sets have a quality score of 5 and 6 (ie. average quality wine). The first and thrid quartiles of data fall within the scores of 5 and 6, meaning at least 50% of wines in the data are from this quality groups.

This improves our understanding of what levels of each variables make average quality wine.

If the data set is more even and has a much higher representation of very low (scores 3 to 4) and very high (scores 7 to 8) quality wines, we will be able to learn more about what levels of each variables makes very good or very bad wines.

Plot Two

Description Two

The second plot shows the relationship between wine quality and its strongest correlated variables, alcohol level.

There is a positive correlation where the higher the alcohol levels, the higher the quality of the wine. There is a clear increase in mean and median of alcohol levels as the quality of wine increases.

Plot Three

Description Three

The third plot shows the combined relationship between alcohol and chloride levels and quality. The combination of these variables are important as it has a clear relationship with wine quality. The lower the chloride levels and the higher the alcohol content, the higher the quality of wine.

This also implies that drinkers tend to prefer stronger and less salty wines!

The combination of both these variables have good predictive qualities for determining wine quality.


Reflection

I ran into difficulty trying to decide the best way to visualise the data. I explored the many different plot options and struggled to choose a single plot as I thought each of the plots helped improve understanding of the data set. I finally decided to use multiple plots to visualise a single variable. Box, scatter and line plots used in combination has been really helpful in this analysis.

I also ran into difficulty with the distribution of some of the variables (such as sulphates and chlorides). They had really long tails and were skewed, and these tend to hide patterns/relationships of these variables. I eventually overcame this by using a log10 scale on both these variables.

I found success and was happy to find some key predictors of wine quality such as alcohol levels, volatile acidity, sulphates, chlorides and citric acid. They have a good level of correlation with quality and when used as predictors in a linear prediction model, it can explain 36% of variability in wine quality. The flipside of this is that an expanded dataset (as elaborated further below), could potentially uncover some stronger predictors an improve the prediction model.

Given the nature of the data set, where a majority of the wines analysed were of average quality, it reduces my confidence in the predictive power of any models built from this data set. It would be good to be able to expand this data set to include both very good and very bad wines.

It would be good to be able to expand on the data set from a quality rating perspective. Currently, we have a single quality score, which may be based on multiple criteria such as taste, texture, price to name a few. It would be good to brign those additional variables (and potentially useful predictors) into the data set. This will open the door for us to analyse things like what drives wine prices and also what combination of chemical properties result in different wine textures.

Overall, I feel that I know more about red wine and the different variables to pick up on when trying to determine what would be a good quality red wine.


References

https://en.wikipedia.org/wiki/Red_wine

https://www.vinodiversity.com/sulphur-preservatives-in-wine.html

http://www.aromadictionary.com/articles/salt_article.html